In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on: * (a) mean squared error * (b) R2 * (c) F-statistic * (d) residual plots
Make predictions using the evaluation data set.
The training data set received had 2276 rows of data, with no duplicate rows. There were many variables within the initial training data set, some of which were redundant or aggregate versions of other variables. After inspection the dependent variable that we want to inspect, and subsequently predict is TARGET_WINS. WIth the exception of the INDEX column, the rest are the independent X variables. Below is a break down of all of the non-index columns and their respective Null value counts, along with the range of their values. Note: All of the columns were continuous integers.
| Column Name | Data Type | Nulls | Range | Notes |
|---|---|---|---|---|
| TARGET_WINS | <int> |
No Nulls | 0 - 146 | DEPENDENT_VAR |
| TEAM_BATTING_H | <int> |
No Nulls | 891 - 2,554 | |
| TEAM_BATTING_2B | <int> |
No Nulls | 69 - 458 | |
| TEAM_BATTING_3B | <int> |
No Nulls | 0 - 223 | |
| TEAM_BATTING_HR | <int> |
No Nulls | 0 - 264 | |
| TEAM_BATTING_BB | <int> |
No Nulls | 0 - 878 | |
| TEAM_BATTING_SO | <int> |
102 Nulls | 0 - 1,399 | |
| TEAM_BASERUN_SB | <int> |
131 Nulls | 0 - 697 | |
| TEAM_BASERUN_CS | <int> |
772 Nulls | 0 - 201 | |
| TEAM_BATTING_HBP | <int> |
2085 Nulls | 29 - 95 | |
| TEAM_PITCHING_H | <int> |
No Nulls | 1,137 - 30,132 | |
| TEAM_PITCHING_HR | <int> |
No Nulls | 0 - 343 | |
| TEAM_PITCHING_BB | <int> |
No Nulls | 0 - 3,645 | |
| TEAM_PITCHING_SO | <int> |
102 Nulls | 0 - 19,278 | |
| TEAM_FIELDING_E | <int> |
No Nulls | 65 - 1,898 | |
| TEAM_FIELDING_DP | <int> |
286 Nulls | 52 - 228 |
Of these columns there are several that have null values in need of resolution. After confirming that null values were not just “zeros in disguise”, the scale of the presence of nulls in the data was examined, only 8 percent of rows from the original 2,276 rows have no nulls. Leading us to conclude strategies for accommodating these null columns will be necessary. Ultimately, several different techniques were leveraged. In addition to these techniques there were one or two other column that were parsed out of larger “aggregate” type columns. Lastly, for a brief overview on all of the initial variable columns and their relationship to one another for those rows without nulls please see the chart below:
GGPairs Table For Training Data
The main takeaways from this glimpse of the raw data with nulls removed were:
Several Variables have noticeable Positive relationships and noticable negative relationships with the TARGET_WINS Variable.
TEAM_BATTING_H and TEAM_PITCHING_H have a direct linear relationship, not independent
TEAM_BATTING_SO and TEAM_PITCHING_SO have a direct linear relationship, not independent.
TEAM_BATTING_BB and TEAM_PITCHING_BB have a direct linear relationship, not independent.
Positive relationship between TEAM_BASERUN_SB and TEAM_BASERUN_CS, may want to merge these to one var.
After this initial evaluation, the data cleaning and processing was carried out.
There were multiple steps taken to process the training data and prepare it for modeling. Firstly, as mentioned, the columns that contained null values needed to be addressed. Please see the image below for a coverage map for the presence of nulls in the training data. Each column with nulls was examined and addressed with a specific solution. These solutions were as follows:
Null Values Visualized with MICE Package
Beyond imputation and accommodating nulls there were other aspects of the data that were edited, this varied by model and the concept behind them. These tactics, because they were specific for the models are discussed in the next section.
There were multiple modeling concepts conceived and tested before ulimately landing on one to use for TARGET_WINS prediction. The models that were devised, and their respcective data alterations are as follows:
This model sought to examine the types of batting gains with respect to base number in order to see if there was any significance in variance explaination for TARGET_WINS of a team. It parsed out TEAM_BATTING_1B from the larger aggregate TEAM_BATTING_H, and modeled the relevant columns.The adjusted R^2 here was around 20% of the variance being explained by these X variables with a decent F-score of 145. P-values show statistical significance. The coefficents here were highest for 3 base hits(0.15), and homeruns (0.11). THis was interesting because one would think Home Runs would have the highest coefficient for impact on wins.
CUSTOM DATA PROCESSING
INITIAL MODEL PERFORMANCE
MODEL 1 SUMMARY
This model sought to examine the influences of defense and offensive actions on the chances of winning the game. These indices were used see if there was any significance in variance explanation for TARGET_WINS of a team. This model was subsequently selected to fine tune and improve because it was the best performing model. The Adjusted R^2 for this model was ~21.3 percent of the variance explained by these two aggregate indices.P-values show statistical significance. The coefficients are pretty low though, with the defense indec being higher than the offense index.
CUSTOM DATA PROCESSING
INITIAL MODEL PERFORMANCE
MODEL 2 SUMMARY
SUBMODELS WITH NUANCE
Within this model, the columns that fed each index were also used in two separate models in order to get a sense of how each influenced the model. Within the main model the granualar models I ran to see the nuanced break down for each column were interesting. The Defensive Granular model was not relevant, as the coefficients were super small. although it should be noted they were negative. However, the TEAM_FIELDING_DP was not statistically significant. Pivoting to the Offensive Granualr Model, this one has all significant p-scores as well as a decent adjusted R^2 of 0.225. The F-score was higher at 222. The coefficients were still small, with TEAM_BATTING_H having the highest by just a bit.
MODEL 2 DEFENSIVE COLUMNS GRANULAR SUMMARY
MODEL 2 OFFENSIVE COLUMNS GRANULAR SUMMARY
This model sought to examine the influences of high risk and low risk actions on the chances of winning the game. High risk actions were multiple base hits, excluding home runs, while low risk behavior included single base gain hits, walking base gains, and batting home runs. This model had statistically significant p=values, and an adjusted R^2 of 0.20. The F-score is relatively higher compared to the other models too. While both coefficients are low, the Low Risk actions had a slightly higher one at 0.039, while the High Risk actions had 0.018. This may make sense depending on how you look at the game, and high risk may be higher reward, but more lower risk activites may help win the game without as much risk.
CUSTOM DATA PROCESSING
INITIAL MODEL PERFORMANCE
MODEL 3 SUMMARY
SUBMODELS WITH NUANCE
Within this model, the columns that fed each index were also used in two separate models in order to get a sense of how each influenced the model. Both of the granular models yielded worse R^2 values than the initial Model 3 performance. All the p-values in these 2 models were statistically significant, but the coefficents are all still pretty low. The highest coefficient in both of these models was that for the TEAM_BATTING_2B (0.10) in the high risk activities model.
MODEL 3 HIGH RISK GRANULAR SUMMARY
MODEL 3 LOW RISK GRANULAR SUMMARY
This model sought to look at the differences in defensive type actions on the TARGET_WINS. Defensive Field activities such as double plays and errors made when a team is in the field. Defensive Pitch activities took a look at the Pitcher’s strike outs, as well as the pitcher’s walks and hits allowed. The adjusted R^2 for this main model was super low at ~2% as was the F-score of 32.
CUSTOM DATA PROCESSING
INITIAL MODEL PERFORMANCE
MODEL 4 SUMMARY
SUBMODELS WITH NUANCE
Within this model, the columns that fed each index were also used in two
separate models in order to get a sense of how each influenced the
model. The granular models performed similarly to the main MODEL 4, so
it isnt worth digging in further.
MODEL 4 PITCH WORK GRANULAR SUMMARY
MODEL 4 FIELD WORK GRANULAR SUMMARY
Of all the models that were initially thought of and executed Model 2 was the best performing, so it was ultimately chosen for refinement and improvement.
After selecting MODEL 2 due to its relatively better performance, the negative valued in the Defense index needed to be corrected before attempting to use Box-Cox to identify a proper lambda value to help fit the dependent variable. This was done by shifting the Defense index to be calulated by dividing the sum of TEAM_PITCHING_SO and TEAM_FIELDING_DP by the TEAM_FIELDING_E column. This yeilded positive values for this index. Additionally, for both the Defensive and Offensive indicies multiple transformations were applied with the results of each examined in order to determine the best transformation. These results can be seen in the table below. Lastly, because there were zeros in the TARGET_WINS column, this column was altered to have 1 added to it.
| Transformation | R² (%) | F-score |
|---|---|---|
| No Change (X Vars) | 21.7 | 317 |
| Log Transform | 22.75 | 335 |
| Square Root (SqRt) | 22.43 | 329 |
| Squared | 21.17 | 317 |
| Cubed | 21.77 | 317 |
These transformations were applied to the data, and Model 2 was run again with updated results. You can see below that the transformed model has normally distributed residuals on the histgram, and mostly normal QQ plot as well. However, there is some slight deviation at the ends on the QQ line plot.
TRANSFORMED MODEL 2 MODEL SUMMARY
TRANSFORMED MODEL 2 HISTOGRAM PLOT
TRANSFORMED MODEL 2 QQ PLOT
Following these transformations, Box-Cox was used on the best performing
transformed model acquire a lambda value for TARGET_WINS. The optimal
lambda value was ~1.27. The lambda value was applied to the TARGET_WINS
column, and the model was run.
TRANSFORMED MODEL 2 BOXCOX MODEL SUMMARY
TRANSFORMED MODEL 2 BOXCOX HISTOGRAM PLOT
TRANSFORMED MODEL 2 BOXCOX QQ PLOT
Overall, the Box Cox lambda application seems to have worsened the model
based on the R^2 and F-Score values, as well as the skewness of it’s
residuals, going from a -0.012 with the transformed model to a 0.106
with the Box-Cox lambda application. The MODEL 2 Transformed with logs
and the adjusted Defensive index was the best model of those tried. With
that being said this model still only explained around 21.8 percent of
the variance in the TARGET_WINS column. The F-Score was fairly high,
helping to support the legitimacy of this modal. The coefficients in
this model indicate that the Offensive Index is more important for the
TARGET_WIN prediction than the Deffensive index, as their respective
coefficients are 239,5 and 3.72. If there was another variable similar
to TEAM_FIELD_E for the offensive side of things to divide by in the
Offensive index i wonder how this would shift the model.
Finally, This model was then used to predict TARGET_WIN values in the moneyball_eval_data, specifically after the identical restructuring and processing was completed on that dataset. See below (commented out for knitting):
#predictions <- predict(model2_2, newdata=moneyball_eval_formatted) # Predict on TestData
# JOINING WITH NON-INDEX COLUMN DF, the original imputed data with imputation flags ("M" - MIssing | "P" - Present) for any potentatial recreation.
#predicted_win_data <- cbind(moneyball_eval_imputed, PRED_TARGET_WINS = predictions)
#head(predicted_win_data)
#write.csv(predicted_win_data, "TARGET_WINS_Prediction_moneyball_data.csv")
The appendix below contains all of my code, most of it relavant to my thought process and some of it unused scrap.
#Reading in both data sets, However only inspecting the Training Data for now because that is what we want to work with.
moneyball_train_data = read.csv(url("https://raw.githubusercontent.com/jhnboyy/CUNY_SPS_WORK/refs/heads/main/Spring2025/DATA621/data/moneyball-training-data.csv"))
moneyball_eval_data = read.csv(url("https://raw.githubusercontent.com/jhnboyy/CUNY_SPS_WORK/refs/heads/main/Spring2025/DATA621/data/moneyball-evaluation-data.csv"))
# Taking a look at the first five rows, along with the columns and their respective data types.
print(head(moneyball_train_data))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
## Confirming distinct values in df before count and summary
print(nrow(moneyball_train_data)) # 2276
## [1] 2276
moneyball_train_data <- moneyball_train_data |> distinct()
print(nrow(moneyball_train_data)) # no Dupes, 2276
## [1] 2276
### Summary Stats of the training data.
print(summary(moneyball_train_data))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
## Overview Non-Index Columns <datatypes>: Notes
#TARGET_WINS <int>: No nulls, Range: 0 - 146
#TEAM_BATTING_H <int>: No Nulls, 891 - 2,554
#TEAM_BATTING_2B <int>: No Nulls, 69 - 458
#TEAM_BATTING_3B <int>: No Nulls, 0 - 223
#TEAM_BATTING_HR <int>: No Nulls, 0 - 264
#TEAM_BATTING_BB <int>: No Nulls, 0 - 878
#TEAM_BATTING_SO <int>: 102 Nulls, 0 - 1,399
#TEAM_BASERUN_SB <int>: 131 Nulls, 0 - 697
#TEAM_BASERUN_CS <int>: 772 Nulls, 0 - 201
#TEAM_BATTING_HBP <int>: 2085 Nulls, 29 - 95
#TEAM_PITCHING_H <int>: No Nulls, 1137 - 30,132
#TEAM_PITCHING_HR <int>: No Nulls, 0 - 343
#TEAM_PITCHING_BB <int>: No Nulls, 0 - 3645
#TEAM_PITCHING_SO <int>: 102 Nulls, 0 - 19278
#TEAM_FIELDING_E <int>: No Nulls, 65 - 1898
#TEAM_FIELDING_DP <int>: 286 Nulls, 52 - 228
## TARGET_WINS is the dependent var.
## The other various predictors would be independent vars.
## May need to use imputation for some of these because of nulls.
## Based on Description Table the following relationships may be true:
## TEAM_BATTING_H may be somewhat an aggregate of TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR
## - The sum of the latter three variables minus the TEAM_BATTING_H should be "TEAM_BATTING_1B"
## - Confirming my Understanding of this:
temp_hit_df <- moneyball_train_data |>
dplyr::select(TEAM_BATTING_H,TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR) |>
mutate("TEAM_BATTING_1B" = TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))
print(head(temp_hit_df))
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 1445 194 39 13
## 2 1339 219 22 190
## 3 1377 232 35 137
## 4 1387 209 38 96
## 5 1297 186 27 102
## 6 1279 200 36 92
## TEAM_BATTING_1B
## 1 1199
## 2 908
## 3 973
## 4 1044
## 5 982
## 6 951
## Are Errors (TEAM_FIELDING_E) and aggregate of the "allowed" categories: TEAM_PITCHING_BB,TEAM_PITCHING_H,TEAM_PITCHING_HR
temp_error_df <- moneyball_train_data |>
dplyr::select(TEAM_FIELDING_E,TEAM_PITCHING_BB,TEAM_PITCHING_H,TEAM_PITCHING_HR) |>
mutate("Allowed_Sum" = TEAM_PITCHING_BB+TEAM_PITCHING_H+TEAM_PITCHING_HR,
"equiv_test" = TEAM_FIELDING_E == Allowed_Sum)
print(head(temp_error_df))
## TEAM_FIELDING_E TEAM_PITCHING_BB TEAM_PITCHING_H TEAM_PITCHING_HR Allowed_Sum
## 1 1011 927 9364 84 10375
## 2 193 689 1347 191 2227
## 3 175 602 1377 137 2116
## 4 164 454 1396 97 1947
## 5 138 472 1297 102 1871
## 6 123 443 1279 92 1814
## equiv_test
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## This is not the case, there are no instances where this is true.
temp_error_df |>
filter(equiv_test==TRUE)
## [1] TEAM_FIELDING_E TEAM_PITCHING_BB TEAM_PITCHING_H TEAM_PITCHING_HR
## [5] Allowed_Sum equiv_test
## <0 rows> (or 0-length row.names)
## According to the descriptive document for this homework, these stats are seasonal perforance for a teams over time (1871 - 2006)
## Curious how many teams are in this data, although there is no way to parse them out.
total_years <- 2006-1871
print(total_years)
## [1] 135
# Combining both aspects of the data because this is the same data but split for this task:
total_rows <- nrow(moneyball_train_data)+nrow(moneyball_eval_data)
## Total teams covered:
print(total_rows/total_years) ## About 18 - 19 teams.
## [1] 18.77778
## ABOVE TEAM WORK NOT APPLICABLE AFTER REREADING THE DESCROPTION OF THE DATA
# Looking at nulls, For those columns with Nulls See what the other columns look like. Are they the same rows? Are they different rows?
## --- TEAM_BATTING_SO (Batting Strike out nulls)
so_nulls <- moneyball_train_data |> filter(is.na(TEAM_BATTING_SO))
#so_nulls
## Seems that TEAM_BASERUN_CS,TEAM_BATTING_HBP, and TEAM_PITCHING_SO are null in the same instances as the the strike out nulls. Lets Check,
print(unique(so_nulls$TEAM_BASERUN_CS))
## [1] NA
print(unique(so_nulls$TEAM_BATTING_HBP))
## [1] NA
print(unique(so_nulls$TEAM_PITCHING_SO))
## [1] NA
## Confirmed all of these values are null when the Pitching Strike outs are null
## Are the nulls here just zeros? Lets see if zeros exist anywhere in their respective non null values.
## Using Summary again:
print(summary(moneyball_train_data))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
#TEAM_BATTING_SO - Zeros present
##TEAM_BASERUN_SB - Zeros present
#TEAM_BASERUN_CS - Zeros present
#TEAM_BATTING_HBP - No Zeros
#TEAM_PITCHING_SO - Zeros present
#TEAM_FIELDING_DP - No zeros present
## Most of the columns that have nulls also have zeros, so structurally speaking the nulls are most likely not zeros and will need imputation, or we will need to remove the rows.
## What proportion of the data is null, would it be worth removing?
no_nulls<-moneyball_train_data |>
filter(!is.na(TEAM_BATTING_SO),
!is.na(TEAM_BASERUN_SB),
!is.na(TEAM_BASERUN_CS),
!is.na(TEAM_BATTING_HBP),
!is.na(TEAM_PITCHING_SO),
!is.na(TEAM_FIELDING_DP))
#print(paste("Only", round(nrow(no_nulls)/ nrow(moneyball_train_data),2)*100,"percent of rows from the original",nrow(moneyball_train_data),"rows have no nulls, we will need to impute."))
## Taking a look at the data with pair plots
#colnames(moneyball_train_data)
#"TARGET_WINS", "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB" "TEAM_BASERUN_CS" #"TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
moneyball_train_nona <- na.omit(moneyball_train_data)
## NOTE THIS IS A VERY SMALL SUBSET BECAUSE OF THE NULLS
print(nrow(moneyball_train_nona)) #191
## [1] 191
p <- ggpairs(moneyball_train_nona,
progress = FALSE) +
theme_minimal(base_size=9) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0, hjust = 1))
## Saving Image for reference in PDF #
#ggsave("ggpairs_plot.png", plot = p, width = 15, height = 20, units = "in", dpi = 300)
### Notesworthy Points from the GGPAIRs plot.
#-- TEAM_BATTING_H and TEAM_PITCHING_H have a direct linear relationship, not independent
#-- TEAM_BATTING_SO and TEAM_PITCHING_SO have a direct linear relationship, not independent.
#-- TEAM_BATTING_BB and TEAM_PITCHING_BB have a direct linear relationship, not independent.
#-- Positive relationship between TEAM_BASERUN_SB and TEAM_BASERUN_CS, may want to merge these to one var.
#-- There may be some other relationships that slip past the visual checks, but will deal with those later.
### Action Items for modeling:
## Should keep just one of each of these variables when modeling for sake of predictor independence.
## Similarly, we should parse out the 1H from the TEAM_BATTING_H or just keep only the agg var for modeling.
## Similar actions needed for TEAM_PITCHING_H and TEAM_PITCHING_HR, these variables are related as HR is a subset of hits.
## TEAM_BASERUN_SB and TEAM_BASERUN_CS aggregated into one Stealing base index, because noticable relationship in plot
## Looking at the columns with nulls.
## Plot work for moneyball_train_data$TEAM_BATTING_SO
ggplot(moneyball_train_data, aes(x = TEAM_BATTING_SO)) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_BATTING_SO") +
theme_minimal()
## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_BATTING_SO)) +
stat_qq() +
theme_minimal()
## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Not normal, camel shaped
## Plot work for moneyball_train_data$TEAM_BASERUN_CS
ggplot(moneyball_train_data, aes(x = TEAM_BASERUN_CS)) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_BASERUN_CS") +
theme_minimal()
## Warning: Removed 772 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_BASERUN_CS)) +
stat_qq() +
theme_minimal()
## Warning: Removed 772 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Right Skew in the data, not normal
## Plot work for moneyball_train_data$TEAM_BASERUN_SB
ggplot(moneyball_train_data, aes(x = TEAM_BASERUN_SB)) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_BASERUN_SB") +
theme_minimal()
## Warning: Removed 131 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_BASERUN_SB)) +
stat_qq() +
theme_minimal()
## Warning: Removed 131 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Right Skew in the data, not normal.
## Plot work for moneyball_train_data$TEAM_BATTING_HBP
ggplot(moneyball_train_data, aes(x = TEAM_BATTING_HBP)) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_BATTING_HBP") +
theme_minimal()
## Warning: Removed 2085 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_BATTING_HBP)) +
stat_qq() +
theme_minimal()
## Warning: Removed 2085 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Seems normal. Easily seen when bin number is lowered, qq plot mostly comfirms
## Plot work for moneyball_train_data$TEAM_PITCHING_SO
ggplot(moneyball_train_data, aes(x = 1+log(TEAM_PITCHING_SO))) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_PITCHING_SO") +
theme_minimal()
## Warning: Removed 122 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_PITCHING_SO)) +
stat_qq() +
theme_minimal()
## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Seems normal, with a super high outlier. May need to deal with this
summary(moneyball_train_data$TEAM_PITCHING_SO)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 615.0 813.5 817.7 968.0 19278.0 102
sort(unique(moneyball_train_data$TEAM_PITCHING_SO)) ## Two very large outliers: 12758, 19278
## [1] 0 181 205 208 252 270 272 292 301 306 310 316
## [13] 320 326 329 341 345 358 361 362 363 364 366 367
## [25] 369 373 374 375 376 378 380 381 383 384 386 387
## [37] 390 391 392 393 395 396 398 400 401 402 406 407
## [49] 409 410 411 412 414 416 417 419 420 422 423 426
## [61] 427 431 432 433 434 436 437 438 440 443 444 446
## [73] 447 448 449 450 451 454 455 459 460 461 462 463
## [85] 464 465 466 467 468 469 470 471 473 475 477 478
## [97] 479 480 481 483 484 485 486 488 489 490 491 493
## [109] 494 495 496 497 498 499 500 502 503 504 505 506
## [121] 507 508 510 511 512 513 514 515 516 517 518 519
## [133] 520 521 522 524 525 526 527 528 529 530 531 533
## [145] 534 535 536 537 539 540 541 542 543 545 546 547
## [157] 548 549 551 552 553 554 555 556 557 558 559 560
## [169] 561 562 563 564 565 566 567 568 569 570 571 572
## [181] 573 574 575 576 577 578 579 580 581 582 583 584
## [193] 585 586 587 588 589 590 591 592 593 594 595 596
## [205] 597 598 599 600 601 602 603 604 605 606 607 608
## [217] 609 610 611 612 613 614 615 616 617 619 620 621
## [229] 622 623 624 625 626 627 628 629 630 631 633 634
## [241] 635 636 637 638 639 640 641 642 643 644 645 646
## [253] 647 648 649 650 651 652 653 654 655 656 657 658
## [265] 659 660 661 662 663 664 665 666 667 668 669 670
## [277] 671 672 673 674 675 678 679 680 681 682 683 684
## [289] 685 686 687 688 690 691 692 693 694 695 696 697
## [301] 698 700 702 703 704 705 707 708 709 710 711 712
## [313] 713 714 715 716 717 718 719 720 721 722 723 724
## [325] 725 726 727 728 729 730 731 732 733 734 735 736
## [337] 737 738 739 740 741 742 743 744 745 746 747 749
## [349] 750 751 752 753 754 755 756 757 758 759 760 761
## [361] 762 764 765 766 767 768 769 770 771 772 774 775
## [373] 777 778 779 781 782 784 785 787 788 789 790 791
## [385] 792 793 794 795 796 797 799 800 801 802 803 804
## [397] 805 806 807 808 809 810 811 812 813 814 815 816
## [409] 817 818 819 820 821 822 823 824 825 826 827 828
## [421] 829 830 831 832 833 834 835 836 837 838 839 840
## [433] 841 842 843 844 845 846 847 848 849 850 851 852
## [445] 853 854 855 856 857 858 860 861 862 863 864 865
## [457] 866 867 868 869 870 871 872 873 874 875 876 877
## [469] 879 880 881 882 883 884 885 887 888 889 890 891
## [481] 892 893 896 897 898 899 900 901 902 903 904 905
## [493] 906 907 908 909 910 911 912 913 914 915 916 917
## [505] 918 919 920 921 922 923 924 925 926 927 928 929
## [517] 930 931 932 933 934 935 936 937 938 939 940 941
## [529] 942 943 944 946 947 948 949 950 951 952 953 954
## [541] 955 956 957 958 959 960 961 962 963 964 965 966
## [553] 967 968 969 970 971 972 973 974 975 976 977 978
## [565] 979 980 981 982 983 984 985 986 987 988 989 990
## [577] 992 993 994 995 996 997 998 1000 1001 1003 1004 1005
## [589] 1006 1008 1009 1010 1011 1012 1013 1014 1015 1016 1018 1019
## [601] 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031
## [613] 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043
## [625] 1044 1045 1046 1047 1048 1049 1052 1053 1054 1055 1056 1057
## [637] 1058 1059 1060 1061 1062 1063 1066 1067 1068 1069 1070 1071
## [649] 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083
## [661] 1084 1085 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096
## [673] 1097 1098 1099 1102 1103 1104 1105 1106 1107 1108 1109 1110
## [685] 1111 1112 1113 1114 1116 1117 1119 1120 1121 1122 1123 1124
## [697] 1125 1126 1129 1130 1132 1133 1134 1136 1138 1141 1143 1144
## [709] 1145 1146 1148 1153 1155 1156 1157 1158 1159 1160 1161 1162
## [721] 1164 1167 1168 1169 1170 1172 1173 1176 1179 1180 1181 1185
## [733] 1188 1189 1190 1191 1192 1193 1194 1197 1198 1200 1203 1204
## [745] 1205 1206 1208 1210 1211 1217 1221 1222 1233 1237 1239 1245
## [757] 1247 1248 1249 1253 1256 1257 1258 1261 1269 1273 1275 1276
## [769] 1284 1293 1296 1299 1303 1305 1313 1326 1328 1335 1342 1349
## [781] 1350 1351 1354 1356 1364 1370 1371 1385 1386 1387 1399 1405
## [793] 1412 1419 1424 1427 1430 1434 1436 1459 1461 1464 1474 1481
## [805] 1491 1511 1541 1552 1561 1590 1600 1659 1739 1781 2225 2309
## [817] 2367 2492 3450 4224 5456 12758 19278
## Is this ONLY two rows?
nrow(moneyball_train_data |> filter(TEAM_PITCHING_SO>12000)) # Confirmed only two rows, may just want ot remove these.
## [1] 2
moneyball_train_data |> filter(TEAM_PITCHING_SO>12000)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1494 108 1188 338 0
## 2 2380 41 992 263 20
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 0 270 945 NA
## 2 0 142 952 NA
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 16038 0
## 2 NA NA 20088 0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 3645 12758 716 NA
## 2 2876 19278 952 NA
## Plot work for moneyball_train_data$TEAM_FIELDING_DP
ggplot(moneyball_train_data, aes(x = TEAM_FIELDING_DP)) +
geom_histogram( bins = 50) +
labs(title = "Histogram for TEAM_FIELDING_DP") +
theme_minimal()
## Warning: Removed 286 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(moneyball_train_data, aes(sample = TEAM_FIELDING_DP)) +
stat_qq() +
theme_minimal()
## Warning: Removed 286 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Distribution conclusion: Seems normal
#, fig.width = 5, fig.height =10}
## Using Mice Native tool to example the pattern of nulls:
##Imputation Reference: https://www.r-bloggers.com/2015/10/imputing-missing-data-with-r-mice-package/
#md.pattern(moneyball_train_data,rotate.names = TRUE)
## Viz Notes:
#- Enough data to impute TEAM_BATTING_SO and TEAM_PITCHING_SO. Need to impute.
#- TEAM_BASERUN_CS and TEAM_BASERUN_SB it may be worth just to agg to a column that is total attempts at base stealing for team, as this strategy's impact is what we want to know.No need to impute
#- Can probably combine the TEAM_BATTING_HBP and TEAM_BATTING_BB as they are essentially equivalents for impact on game. As total batting walks. No need to impute with HBP because way too many nulls.
## Looking into TEAM_FIELDING_DP (286 Nulls vs)
nrow(moneyball_train_data |> filter(!is.na(TEAM_FIELDING_DP)))#1,990 non null rows can impute. ~14% nulls
## [1] 1990
### Takeaways for Imputation Process:
# - I was going to try different methods within MICE, but decided to just keep all of them PMM in order to be consistent for explaining. The TEAM_PITCHING_SO outliers shouild be handled separately though.
# - Overall To Do's:
# --- TEAM_BATTING_SO - Imputation with MICE PMM Needed
# --- TEAM_PITCHING_SO - Imputation with MICE PMM Needed (Probably need to remove the two outliers from the TEAM_PITCHING_SO column before imputation)
# --- TEAM_BASERUN_SB and TEAM_BASERUN_CS to be combined into TEAM_BASERUN_STEAL_ATTEMPTS; No Imputation needed.
# --- TEAM_BATTING_HBP can be combined with TEAM_BATTING_BB (no null values) into TEAM_BATTING_WALK_TOTAL
# --- TEAM_FIELDING_DP -Imputation with MICE PMM Needed
### USING ACTION ITEMS AS GUIDE
## Adding imputation /missing data Flags to keep track of the changes:
## Present (P) and Missing (M)
moneyball_train_data$TEAM_BATTING_SO_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_BATTING_SO), 'M', 'P')
moneyball_train_data$TEAM_PITCHING_SO_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_PITCHING_SO), 'M', 'P')
moneyball_train_data$TEAM_BASERUN_SB_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_BASERUN_SB), 'M', 'P')
moneyball_train_data$TEAM_BASERUN_CS_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_BASERUN_CS), 'M', 'P')
moneyball_train_data$TEAM_BATTING_BB_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_BATTING_BB), 'M', 'P')
moneyball_train_data$TEAM_BATTING_HBP_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_BATTING_HBP), 'M', 'P')
moneyball_train_data$TEAM_FIELDING_DP_FLAG <- ifelse(is.na(moneyball_train_data$TEAM_FIELDING_DP), 'M', 'P')
# --- TEAM_BASERUN_SB and TEAM_BASERUN_CS to be combined into TEAM_BASERUN_STEAL_ATTEMPTS; No Imputation needed for certain models
# --- TEAM_BATTING_HBP can be combined with TEAM_BATTING_BB (no null values) into TEAM_BATTING_WALK_TOTAL
moneyball_train_processed <- moneyball_train_data |>
mutate(TEAM_BASERUN_STEAL_ATTEMPTS = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB) + ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
TEAM_BATTING_WALK_TOTAL = ifelse(is.na(TEAM_BATTING_BB), 0, TEAM_BATTING_BB) + ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP)) |>
dplyr::select(-TEAM_BASERUN_CS, -TEAM_BATTING_HBP) #-TEAM_BASERUN_SB,-TEAM_BATTING_BB
## In these aggregate derivative rows, im considering the nulls to be zero for summation.
# --- TEAM_PITCHING_SO,TEAM_BATTING_SO,TEAM_FIELDING_DP,TEAM_BASERUN_SB,TEAM_BATTING_BB - Imputation with MICE PMM Needed. Probably need to remove the two outlier columns from the TEAM_PITCHING_SO column, and the data at large.
outlier_rows <- moneyball_train_processed |> filter(TEAM_PITCHING_SO>12000)
no_outlier_moneyball_train_processed <- moneyball_train_processed |> filter(TEAM_PITCHING_SO<12000 | is.na(TEAM_PITCHING_SO))
imputed <- mice(no_outlier_moneyball_train_processed,m=5,meth='pmm',seed=5)
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## Warning: Number of logged events: 7
moneyball_train_imputed<-complete(imputed)
#Checking for nulls where there shouldnt be.
print(nrow(moneyball_train_imputed |> filter(is.na(TEAM_PITCHING_SO))))
## [1] 0
## Working DF
print(head(moneyball_train_imputed))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 39
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 1 9364 84 927 5456
## 2 1347 191 689 1082
## 3 1377 137 602 917
## 4 1396 97 454 928
## 5 1297 102 472 920
## 6 1279 92 443 973
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG
## 1 1011 190 P P
## 2 193 155 P P
## 3 175 153 P P
## 4 164 156 P P
## 5 138 168 P P
## 6 123 149 P P
## TEAM_BASERUN_SB_FLAG TEAM_BASERUN_CS_FLAG TEAM_BATTING_BB_FLAG
## 1 M M P
## 2 P P P
## 3 P P P
## 4 P P P
## 5 P P P
## 6 P P P
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_STEAL_ATTEMPTS
## 1 M M 0
## 2 M P 65
## 3 M P 73
## 4 M P 73
## 5 M P 88
## 6 M P 166
## TEAM_BATTING_WALK_TOTAL
## 1 143
## 2 685
## 3 602
## 4 451
## 5 472
## 6 443
## Now that the data is imputed, and some of the columns have been aggregated, taking the next step in processing and/or variable preparation and selection for the models.
## Printing this to look at the columns again:
#colnames(moneyball_train_imputed)
## Thoughts
#- Batting vs BaseRun vs Fielding vs Pitching Agg vars -> Offesne vs Defense
#- Parse out the 1HB, se can do a correlation with the type of hits vs total wins.
#- Steals vs caught steals, maybe also this coirrelation with the hits?
#- Need to keep in mind we need to predict winds for the team.
## Three main Model Ideas that will dictate further processing:
# --- MODEL 1: Parse out "TEAM_BATTING_1B" from TEAM_BATTING_H, then can do analysis on the different types of base hits and their correlation with wins.
# --- MODEL 2: Aggregate into Two index Defense and Offense (Pitching and Field VAS Batting and Baserun Metrics) to see which is more important
# --- MODEL 3: High Risk Offensive Model, Looking at if the "higher risk" offensive plays are worth it.
# --- MODEL 4: Pitching Index Model vs Field Index for Defense analysis.
#### Keeping two dfs for each model (except 1), one with the non aggregated values and one with custom agg vars.
### MODEL 1 Processing
# --- Limiting to TARGET_WINS and the parsed out Hit columns
model_1_df <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_BATTING_H,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR)|>
mutate (TEAM_BATTING_1B = (TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))) |>
dplyr::select(-TEAM_BATTING_H)
print(head(model_1_df))
## TARGET_WINS TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_1B
## 1 39 194 39 13 1199
## 2 70 219 22 190 908
## 3 86 232 35 137 973
## 4 70 209 38 96 1044
## 5 82 186 27 102 982
## 6 75 200 36 92 951
### MODEL 2 Processing
## Grouping Proper Defense and Offense. Ensuring that those non-independent vars (e.g.,TEAM_BATTING_H and TEAM_PITCHING_H) as accounted for.
# --- OTHER INCLUSION:
# TARGET_WINS
# TEAM_BATTING_SO_FLAG
# TEAM_PITCHING_SO_FLAG
# TEAM_BATTING_BB_FLAG
# TEAM_BATTING_HBP_FLAG
# TEAM_FIELDING_DP_FLAG
# TEAM_BASERUN_SB_FLAG
# --- DEFENSE INCLUSION:
#-- TEAM_PITCHING_SO
#-- TEAM_FIELDING_DP
#-- TEAM_FIELDING_E
# --- OFFENSE INCLUSION:
#-- TEAM_BATTING_H
#-- TEAM_BATTING_BB
#-- TEAM_BASERUN_SB
#-- NOTE: In order to account for the non-independent vars on both sides, for each index we are only considering those that positively impact the index from the redundant variables. For instance, from the variables TEAM_PITCHING_SO and TEAM_BATTING_SO, only TEAM_PITCHING_SO will be used because it would positively contribute to a good defense. Howver, variables like TEAM_FIELDING_E that do not have another 1:1 relationship with another column will be used as a factor in defense index.
## Custom Aggregate
model_2_dfa <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_BATTING_SO_FLAG,TEAM_PITCHING_SO_FLAG,TEAM_BATTING_BB_FLAG,TEAM_BATTING_HBP_FLAG,TEAM_FIELDING_DP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_PITCHING_SO,TEAM_FIELDING_DP,TEAM_FIELDING_E,TEAM_BATTING_H,TEAM_BATTING_WALK_TOTAL,TEAM_BATTING_BB,TEAM_BASERUN_SB) |>
mutate(DEFENSE_INDEX=((TEAM_PITCHING_SO+TEAM_FIELDING_DP)-TEAM_FIELDING_E),
OFFENSE_INDEX=(TEAM_BATTING_H+TEAM_BATTING_BB+TEAM_BASERUN_SB)) |>
dplyr::select(-TEAM_PITCHING_SO,-TEAM_FIELDING_DP,-TEAM_FIELDING_E,-TEAM_BATTING_H,-TEAM_BATTING_WALK_TOTAL,-TEAM_BATTING_BB,-TEAM_BASERUN_SB)
print(head(model_2_dfa))
## TARGET_WINS TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG TEAM_BATTING_BB_FLAG
## 1 39 P P P
## 2 70 P P P
## 3 86 P P P
## 4 70 P P P
## 5 82 P P P
## 6 75 P P P
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_SB_FLAG
## 1 M M M
## 2 M P P
## 3 M P P
## 4 M P P
## 5 M P P
## 6 M P P
## DEFENSE_INDEX OFFENSE_INDEX
## 1 4635 1627
## 2 1044 2061
## 3 895 2025
## 4 920 1881
## 5 950 1818
## 6 999 1829
## GRANULAR DEFENSE - TEAM_PITCHING_SO TEAM_FIELDING_DP TEAM_FIELDING_E
model_2_df_granular_defense <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_BATTING_SO_FLAG,TEAM_PITCHING_SO_FLAG,TEAM_BATTING_BB_FLAG,TEAM_BATTING_HBP_FLAG,TEAM_FIELDING_DP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_PITCHING_SO,TEAM_FIELDING_DP,TEAM_FIELDING_E)
print(head(model_2_df_granular_defense))
## TARGET_WINS TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG TEAM_BATTING_BB_FLAG
## 1 39 P P P
## 2 70 P P P
## 3 86 P P P
## 4 70 P P P
## 5 82 P P P
## 6 75 P P P
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_SB_FLAG
## 1 M M M
## 2 M P P
## 3 M P P
## 4 M P P
## 5 M P P
## 6 M P P
## TEAM_PITCHING_SO TEAM_FIELDING_DP TEAM_FIELDING_E
## 1 5456 190 1011
## 2 1082 155 193
## 3 917 153 175
## 4 928 156 164
## 5 920 168 138
## 6 973 149 123
## GRANULAR OFFENSE - TEAM_BATTING_H TEAM_BATTING_BB TEAM_BASERUN_SB (Keeping TEAM_BATTING_WALK_TOTAL b/c agg of HBP is not really offense, out of team's control)
model_2_df_granualar_offense <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_BATTING_SO_FLAG,TEAM_PITCHING_SO_FLAG,TEAM_BATTING_BB_FLAG,TEAM_BATTING_HBP_FLAG,TEAM_FIELDING_DP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_BATTING_H,TEAM_BATTING_BB,TEAM_BASERUN_SB)
### MODEL 3 Risk Index
# --- OTHER INCLUSION:
# TARGET_WINS
# TEAM_BATTING_BB_FLAG
# TEAM_BASERUN_CS_FLAG
# TEAM_BATTING_HBP_FLAG
# TEAM_BASERUN_SB_FLAG
#-- HIGH RISK OFFENSE:
# TEAM_BASERUN_STEAL_ATTEMPTS
# TEAM_BATTING_2B
# TEAM_BATTING_3B
#-- LOW RISK OFFENSE:
# TEAM_BATTING_1B
# TEAM_BATTING_WALK_TOTAL
## Custom Aggregate
model_3_dfa <- moneyball_train_imputed |>
mutate (TEAM_BATTING_1B = (TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))) |>
dplyr::select( TARGET_WINS,TEAM_BATTING_BB_FLAG,TEAM_BASERUN_CS_FLAG, TEAM_BATTING_HBP_FLAG,TEAM_BASERUN_SB_FLAG,TEAM_BATTING_HR,TEAM_BATTING_1B,
TEAM_BATTING_WALK_TOTAL,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BASERUN_STEAL_ATTEMPTS) |>
mutate (HIGH_RISK = (TEAM_BASERUN_STEAL_ATTEMPTS+TEAM_BATTING_2B+TEAM_BATTING_3B),
LOW_RISK = (TEAM_BATTING_1B + TEAM_BATTING_WALK_TOTAL+TEAM_BATTING_HR)) |>
dplyr::select(-TEAM_BATTING_1B,-TEAM_BATTING_WALK_TOTAL,-TEAM_BATTING_2B,-TEAM_BATTING_3B,-TEAM_BASERUN_STEAL_ATTEMPTS,-TEAM_BATTING_HR)
print(head(model_3_dfa))
## TARGET_WINS TEAM_BATTING_BB_FLAG TEAM_BASERUN_CS_FLAG TEAM_BATTING_HBP_FLAG
## 1 39 P M M
## 2 70 P P M
## 3 86 P P M
## 4 70 P P M
## 5 82 P P M
## 6 75 P P M
## TEAM_BASERUN_SB_FLAG HIGH_RISK LOW_RISK
## 1 M 233 1355
## 2 P 306 1783
## 3 P 340 1712
## 4 P 320 1591
## 5 P 301 1556
## 6 P 402 1486
## High Risk Activities Granular: TEAM_BASERUN_STEAL_ATTEMPTS , TEAM_BATTING_2B, TEAM_BATTING_3B
model_3_df_high_risk_granular <- moneyball_train_imputed |>
mutate (TEAM_BATTING_1B = (TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))) |>
dplyr::select(TARGET_WINS,TEAM_BATTING_BB_FLAG,TEAM_BASERUN_CS_FLAG, TEAM_BATTING_HBP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BASERUN_STEAL_ATTEMPTS)
## Low Risk Activities Granular: TEAM_BATTING_1B, TEAM_BATTING_WALK_TOTAL
model_3_df_low_risk_granular <- moneyball_train_imputed |>
mutate (TEAM_BATTING_1B = (TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))) |>
dplyr::select( TARGET_WINS,TEAM_BATTING_BB_FLAG,TEAM_BASERUN_CS_FLAG, TEAM_BATTING_HBP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_BATTING_1B, TEAM_BATTING_HR, TEAM_BATTING_WALK_TOTAL)
## MODEL 4 Processing
#--DEFENSE Analysis - Which is more important pitching or field play
# --- OTHER INCLUSION:
# TARGET_WINS
# TEAM_PITCHING_SO_FLAG
# TEAM_FIELDING_DP_FLAG
# --- FIELDING INCLUSION:
#TEAM_FIELDING_DP
#TEAM_FIELDING_E
# --- PITCHING INCLUSION:
#TEAM_PITCHING_BB
#TEAM_PITCHING_H (TEAM_PITCHING_HR is included in TEAM_PITCHING_H)
#TEAM_PITCHING_SO
## Custom Aggregate
model_4_dfa <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_PITCHING_SO_FLAG, TEAM_FIELDING_DP_FLAG, TEAM_FIELDING_DP, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, TEAM_PITCHING_SO) |>
mutate(FIELD_WORK_INDEX = (TEAM_FIELDING_DP-TEAM_FIELDING_E),
PITCH_WORK_INDEX = (TEAM_PITCHING_SO-(TEAM_PITCHING_BB+TEAM_PITCHING_H)))|>
dplyr::select(-TEAM_FIELDING_DP,-TEAM_FIELDING_E,-TEAM_PITCHING_BB,-TEAM_PITCHING_H,-TEAM_PITCHING_SO)
print(head(model_4_dfa))
## TARGET_WINS TEAM_PITCHING_SO_FLAG TEAM_FIELDING_DP_FLAG FIELD_WORK_INDEX
## 1 39 P M -821
## 2 70 P P -38
## 3 86 P P -22
## 4 70 P P -8
## 5 82 P P 30
## 6 75 P P 26
## PITCH_WORK_INDEX
## 1 -4835
## 2 -954
## 3 -1062
## 4 -922
## 5 -849
## 6 -749
## Defensive Fieldwork Granular
model_4_df_defensive_field_granular <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_PITCHING_SO_FLAG, TEAM_FIELDING_DP_FLAG,
TEAM_FIELDING_DP, TEAM_FIELDING_E)
## Defensive Pitchwork Granular
model_4_df_defensive_pitch_granular<-moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_PITCHING_SO_FLAG, TEAM_FIELDING_DP_FLAG,
TEAM_PITCHING_SO,TEAM_PITCHING_BB,TEAM_PITCHING_H)
## Running Model with no change
model1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_3B +TEAM_BATTING_2B + TEAM_BATTING_HR, data=model_1_df)
print(summary(model1))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_3B +
## TEAM_BATTING_2B + TEAM_BATTING_HR, data = model_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.390 -8.854 0.603 9.632 49.361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.031397 3.188915 5.654 1.76e-08 ***
## TEAM_BATTING_1B 0.032043 0.003092 10.364 < 2e-16 ***
## TEAM_BATTING_3B 0.157431 0.015193 10.362 < 2e-16 ***
## TEAM_BATTING_2B 0.034017 0.007679 4.430 9.89e-06 ***
## TEAM_BATTING_HR 0.114952 0.007682 14.964 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.04 on 2269 degrees of freedom
## Multiple R-squared: 0.2039, Adjusted R-squared: 0.2025
## F-statistic: 145.3 on 4 and 2269 DF, p-value: < 2.2e-16
plot(model1)
#### Overall adjusted r^2 is 20%, with 3b having the highest coefficient. All p values show statistical significance. Lowest coeffeicsnets ate 1B and 2B, may want ot remove and adjust.
##BELOW NOT WORTH IT
# No 1b and 2b; Adjusted R2 goes down, but coefficients increase a bit for 3B
#model1_3plus <- lm(TARGET_WINS ~ TEAM_BATTING_3B + TEAM_BATTING_HR, data=model_1_df)
#print(summary(model1_3plus))
#plot(model1_3plus)
## Custom Agg work (R^2 is ~21%)
model2 <- lm(TARGET_WINS ~ OFFENSE_INDEX + DEFENSE_INDEX, data=model_2_dfa)
print(summary(model2))
##
## Call:
## lm(formula = TARGET_WINS ~ OFFENSE_INDEX + DEFENSE_INDEX, data = model_2_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.424 -8.976 0.481 9.035 57.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0615756 3.0863725 1.640 0.10115
## OFFENSE_INDEX 0.0353272 0.0014213 24.855 < 2e-16 ***
## DEFENSE_INDEX 0.0025885 0.0007198 3.596 0.00033 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.95 on 2271 degrees of freedom
## Multiple R-squared: 0.214, Adjusted R-squared: 0.2133
## F-statistic: 309.2 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model2)
## Granular work (R^2 is ~7%); All coefficients are negative. This model instance may need a transform to see impact. Low F scroe
model2_def_gran <- lm(TARGET_WINS ~ TEAM_PITCHING_SO+ TEAM_FIELDING_DP+ TEAM_FIELDING_E, data=model_2_df_granular_defense)
print(summary(model2_def_gran))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO + TEAM_FIELDING_DP +
## TEAM_FIELDING_E, data = model_2_df_granular_defense)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.122 -9.705 0.576 9.630 68.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.722892 2.286944 46.666 < 2e-16 ***
## TEAM_PITCHING_SO -0.007813 0.001130 -6.914 6.12e-12 ***
## TEAM_FIELDING_DP -0.102159 0.013076 -7.813 8.47e-15 ***
## TEAM_FIELDING_E -0.021707 0.001726 -12.574 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.11 on 2270 degrees of freedom
## Multiple R-squared: 0.07804, Adjusted R-squared: 0.07682
## F-statistic: 64.05 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model2_def_gran)
# R^2 is 22.5% with total hits having highest coefficient (.04) Worth keeping
model2_off_gran <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB, data=model_2_df_granualar_offense)
print(summary(model2_off_gran))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_BASERUN_SB, data = model_2_df_granualar_offense)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.823 -8.763 0.490 9.235 49.663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.318103 3.299134 -0.703 0.482
## TEAM_BATTING_H 0.043789 0.002029 21.581 < 2e-16 ***
## TEAM_BATTING_BB 0.033815 0.002379 14.214 < 2e-16 ***
## TEAM_BASERUN_SB 0.014734 0.003381 4.358 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.84 on 2270 degrees of freedom
## Multiple R-squared: 0.2268, Adjusted R-squared: 0.2258
## F-statistic: 222 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model2_off_gran)
## Custom Agg (R^2 ~20%) Low coeff.m high f score
model3 <- lm(TARGET_WINS ~ HIGH_RISK + LOW_RISK, data=model_3_dfa)
print(summary(model3))
##
## Call:
## lm(formula = TARGET_WINS ~ HIGH_RISK + LOW_RISK, data = model_3_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.078 -9.081 0.197 9.110 56.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.664325 3.207541 2.078 0.0378 *
## HIGH_RISK 0.018733 0.002900 6.459 1.29e-10 ***
## LOW_RISK 0.039120 0.002029 19.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.06 on 2271 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.2002
## F-statistic: 285.6 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model3)
## R^2 is ~12%
model3_gran_high_risk <- lm(TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B +TEAM_BASERUN_STEAL_ATTEMPTS, data=model_3_df_high_risk_granular)
print(summary(model3_gran_high_risk))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BASERUN_STEAL_ATTEMPTS, data = model_3_df_high_risk_granular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.784 -9.394 0.655 9.728 54.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.976139 1.857844 25.824 < 2e-16 ***
## TEAM_BATTING_2B 0.105327 0.006649 15.842 < 2e-16 ***
## TEAM_BATTING_3B 0.082740 0.011532 7.175 9.74e-13 ***
## TEAM_BASERUN_STEAL_ATTEMPTS 0.018618 0.003500 5.320 1.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.72 on 2270 degrees of freedom
## Multiple R-squared: 0.1249, Adjusted R-squared: 0.1237
## F-statistic: 108 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model3_gran_high_risk)
## R^2 is 19.8%, a ll low coefficients though, f scroe higher
model3_gran_low_risk <- lm(TARGET_WINS ~ TEAM_BATTING_1B +TEAM_BATTING_WALK_TOTAL+TEAM_BATTING_HR, data=model_3_df_low_risk_granular)
print(summary(model3_gran_low_risk))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_WALK_TOTAL +
## TEAM_BATTING_HR, data = model_3_df_low_risk_granular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.285 -8.924 0.259 9.346 52.517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.383255 3.557273 -0.108 0.914
## TEAM_BATTING_1B 0.054282 0.002683 20.229 <2e-16 ***
## TEAM_BATTING_WALK_TOTAL 0.031993 0.002860 11.187 <2e-16 ***
## TEAM_BATTING_HR 0.067185 0.006324 10.624 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.08 on 2270 degrees of freedom
## Multiple R-squared: 0.1996, Adjusted R-squared: 0.1986
## F-statistic: 188.7 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model3_gran_low_risk)
## R^2 is 2.6%, field coefficient is neg probably b/c of TEAM_FIELDING_E. Should transform,
model4 <- lm(TARGET_WINS ~ FIELD_WORK_INDEX + PITCH_WORK_INDEX, data=model_4_dfa)
print(summary(model4))
##
## Call:
## lm(formula = TARGET_WINS ~ FIELD_WORK_INDEX + PITCH_WORK_INDEX,
## data = model_4_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.219 -9.996 0.544 10.294 72.538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.0242086 0.4966042 163.156 < 2e-16 ***
## FIELD_WORK_INDEX 0.0128692 0.0017484 7.361 2.55e-13 ***
## PITCH_WORK_INDEX -0.0007428 0.0003085 -2.408 0.0161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.52 on 2271 degrees of freedom
## Multiple R-squared: 0.0275, Adjusted R-squared: 0.02664
## F-statistic: 32.11 on 2 and 2271 DF, p-value: 1.778e-14
plot(model4)
## R^2 is 5.7%
model4_def_field <- lm(TARGET_WINS ~ TEAM_FIELDING_DP+ TEAM_FIELDING_E, data=model_4_df_defensive_field_granular)
print(summary(model4_def_field))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_DP + TEAM_FIELDING_E,
## data = model_4_df_defensive_field_granular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.906 -9.951 0.547 9.827 73.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.004634 2.153992 46.892 < 2e-16 ***
## TEAM_FIELDING_DP -0.108006 0.013182 -8.193 4.19e-16 ***
## TEAM_FIELDING_E -0.020423 0.001734 -11.779 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.27 on 2271 degrees of freedom
## Multiple R-squared: 0.05863, Adjusted R-squared: 0.0578
## F-statistic: 70.72 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model4_def_field)
## R^2 is 6.6%, neg correlation makes sense
model4_def_pitch <- lm(TARGET_WINS ~ TEAM_PITCHING_SO+TEAM_PITCHING_BB+TEAM_PITCHING_H, data=model_4_df_defensive_pitch_granular)
print(summary(model4_def_pitch))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO + TEAM_PITCHING_BB +
## TEAM_PITCHING_H, data = model_4_df_defensive_pitch_granular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.621 -9.412 0.789 10.052 72.966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0080593 1.4307605 55.221 < 2e-16 ***
## TEAM_PITCHING_SO -0.0093176 0.0011475 -8.120 7.56e-16 ***
## TEAM_PITCHING_BB 0.0228060 0.0022950 9.937 < 2e-16 ***
## TEAM_PITCHING_H -0.0018958 0.0002471 -7.673 2.48e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.2 on 2270 degrees of freedom
## Multiple R-squared: 0.06762, Adjusted R-squared: 0.06639
## F-statistic: 54.88 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model4_def_pitch)
### Selecting the best performing models/ most interesting models, and figuring out possible transforms for improvement.
## Model1 is decent, not amazing though
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_3B +
## TEAM_BATTING_2B + TEAM_BATTING_HR, data = model_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.390 -8.854 0.603 9.632 49.361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.031397 3.188915 5.654 1.76e-08 ***
## TEAM_BATTING_1B 0.032043 0.003092 10.364 < 2e-16 ***
## TEAM_BATTING_3B 0.157431 0.015193 10.362 < 2e-16 ***
## TEAM_BATTING_2B 0.034017 0.007679 4.430 9.89e-06 ***
## TEAM_BATTING_HR 0.114952 0.007682 14.964 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.04 on 2269 degrees of freedom
## Multiple R-squared: 0.2039, Adjusted R-squared: 0.2025
## F-statistic: 145.3 on 4 and 2269 DF, p-value: < 2.2e-16
plot(model1) ## some leveraged points to look at, slight curve in scale location plot (homoscedascity),
#colnames(model_1_df)
## PLOTTING
ggplot(model_1_df, aes(x = TEAM_BATTING_1B, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_1_df, aes(x = TEAM_BATTING_2B, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_1_df, aes(x = TEAM_BATTING_3B, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_1_df, aes(x = TEAM_BATTING_HR, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### MODEL 2
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ OFFENSE_INDEX + DEFENSE_INDEX, data = model_2_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.424 -8.976 0.481 9.035 57.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0615756 3.0863725 1.640 0.10115
## OFFENSE_INDEX 0.0353272 0.0014213 24.855 < 2e-16 ***
## DEFENSE_INDEX 0.0025885 0.0007198 3.596 0.00033 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.95 on 2271 degrees of freedom
## Multiple R-squared: 0.214, Adjusted R-squared: 0.2133
## F-statistic: 309.2 on 2 and 2271 DF, p-value: < 2.2e-16
summary(model2_off_gran)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_BASERUN_SB, data = model_2_df_granualar_offense)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.823 -8.763 0.490 9.235 49.663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.318103 3.299134 -0.703 0.482
## TEAM_BATTING_H 0.043789 0.002029 21.581 < 2e-16 ***
## TEAM_BATTING_BB 0.033815 0.002379 14.214 < 2e-16 ***
## TEAM_BASERUN_SB 0.014734 0.003381 4.358 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.84 on 2270 degrees of freedom
## Multiple R-squared: 0.2268, Adjusted R-squared: 0.2258
## F-statistic: 222 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model2)
## PLOTTING
ggplot(model_2_dfa, aes(x = DEFENSE_INDEX, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_2_dfa, aes(x = OFFENSE_INDEX, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Model Two Granular Offense
ggplot(model_2_df_granualar_offense, aes(x = TEAM_BASERUN_SB, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_2_df_granualar_offense, aes(x = TEAM_BATTING_BB, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_2_df_granualar_offense, aes(x = TEAM_BATTING_H, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## MODEL 3
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ HIGH_RISK + LOW_RISK, data = model_3_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.078 -9.081 0.197 9.110 56.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.664325 3.207541 2.078 0.0378 *
## HIGH_RISK 0.018733 0.002900 6.459 1.29e-10 ***
## LOW_RISK 0.039120 0.002029 19.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.06 on 2271 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.2002
## F-statistic: 285.6 on 2 and 2271 DF, p-value: < 2.2e-16
summary(model3_gran_low_risk)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_WALK_TOTAL +
## TEAM_BATTING_HR, data = model_3_df_low_risk_granular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.285 -8.924 0.259 9.346 52.517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.383255 3.557273 -0.108 0.914
## TEAM_BATTING_1B 0.054282 0.002683 20.229 <2e-16 ***
## TEAM_BATTING_WALK_TOTAL 0.031993 0.002860 11.187 <2e-16 ***
## TEAM_BATTING_HR 0.067185 0.006324 10.624 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.08 on 2270 degrees of freedom
## Multiple R-squared: 0.1996, Adjusted R-squared: 0.1986
## F-statistic: 188.7 on 3 and 2270 DF, p-value: < 2.2e-16
plot(model3) # Few leverage points influencing model, should try transform
plot(model3_gran_low_risk)
## PLOTTING
ggplot(model_3_dfa, aes(x = HIGH_RISK, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_3_dfa, aes(x = LOW_RISK, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## --
ggplot(model_3_df_low_risk_granular, aes(x = TEAM_BATTING_1B, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_3_df_low_risk_granular, aes(x = TEAM_BATTING_HR, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(model_3_df_low_risk_granular, aes(x = TEAM_BATTING_WALK_TOTAL, y = TARGET_WINS)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### I Want to attempt to perfect model 2 and/or 3, they are decent and i think convey more interesting information than 1.
### MODEL 2 Processing & Improving
## Grouping Proper Defense and Offense. Ensuring that those non-independent vars (e.g.,TEAM_BATTING_H and TEAM_PITCHING_H) as accounted for.
# --- OTHER INCLUSION:
# TARGET_WINS
# TEAM_BATTING_SO_FLAG
# TEAM_PITCHING_SO_FLAG
# TEAM_BATTING_BB_FLAG
# TEAM_BATTING_HBP_FLAG
# TEAM_FIELDING_DP_FLAG
# TEAM_BASERUN_SB_FLAG
# --- DEFENSE INCLUSION:
#-- TEAM_PITCHING_SO
#-- TEAM_FIELDING_DP
#-- TEAM_FIELDING_E
# --- OFFENSE INCLUSION:
#-- TEAM_BATTING_H
#-- TEAM_BATTING_BB
#-- TEAM_BASERUN_SB
#--TEAM_BATTING_SO
plot(model2)
colnames(moneyball_train_imputed)
## [1] "INDEX" "TARGET_WINS"
## [3] "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [5] "TEAM_BATTING_3B" "TEAM_BATTING_HR"
## [7] "TEAM_BATTING_BB" "TEAM_BATTING_SO"
## [9] "TEAM_BASERUN_SB" "TEAM_PITCHING_H"
## [11] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB"
## [13] "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [15] "TEAM_FIELDING_DP" "TEAM_BATTING_SO_FLAG"
## [17] "TEAM_PITCHING_SO_FLAG" "TEAM_BASERUN_SB_FLAG"
## [19] "TEAM_BASERUN_CS_FLAG" "TEAM_BATTING_BB_FLAG"
## [21] "TEAM_BATTING_HBP_FLAG" "TEAM_FIELDING_DP_FLAG"
## [23] "TEAM_BASERUN_STEAL_ATTEMPTS" "TEAM_BATTING_WALK_TOTAL"
## Custom Aggregate with Transform, Making Better
model_2_dfa_transform <- moneyball_train_imputed |>
dplyr::select(TARGET_WINS,TEAM_BATTING_SO_FLAG,TEAM_PITCHING_SO_FLAG,TEAM_BATTING_BB_FLAG,TEAM_BATTING_HBP_FLAG,TEAM_FIELDING_DP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_PITCHING_SO,TEAM_FIELDING_DP,TEAM_FIELDING_E,TEAM_BATTING_H,TEAM_BATTING_WALK_TOTAL,TEAM_BATTING_BB,TEAM_BASERUN_SB,TEAM_BATTING_SO) |>
mutate( DEFENSE_INDEX = ((TEAM_PITCHING_SO + TEAM_FIELDING_DP) / TEAM_FIELDING_E),## Shifting this to keep things positive, but maintain the error influence
OFFENSE_INDEX=(TEAM_BATTING_H+TEAM_BATTING_BB+TEAM_BASERUN_SB)) |>
dplyr::select(-TEAM_PITCHING_SO,-TEAM_FIELDING_DP,-TEAM_FIELDING_E,-TEAM_BATTING_H,-TEAM_BATTING_WALK_TOTAL,-TEAM_BATTING_BB,-TEAM_BASERUN_SB)
## Adding 1 to Target Wins Ver to use Box Cox
model2_2 <- lm(TARGET_WINS+1 ~ log(DEFENSE_INDEX) + log(OFFENSE_INDEX), data=model_2_dfa_transform)
print(summary(model2_2))
##
## Call:
## lm(formula = TARGET_WINS + 1 ~ log(DEFENSE_INDEX) + log(OFFENSE_INDEX),
## data = model_2_dfa_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.831 -9.038 0.316 8.871 59.607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -489.8544 22.3094 -21.957 < 2e-16 ***
## log(DEFENSE_INDEX) 1.5386 0.3449 4.462 8.54e-06 ***
## log(OFFENSE_INDEX) 74.5013 2.9183 25.529 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.82 on 2271 degrees of freedom
## Multiple R-squared: 0.2281, Adjusted R-squared: 0.2275
## F-statistic: 335.6 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model2_2)
hist(model2_2$resid)
qqnorm(model2_2$resid)
skewness(model2_2$resid) #-0.012
## [1] -0.01202184
## No Change X Vars: r^2 21.7% | F-score 317
## Log Transform: r^2 22.75% | F-score 335 [THIS MODEL IS BEST BY SMALL AMOUNT]
## SqRt: r^2 22.43% | F-score 329
## Squared: r^2 21.17% | F-score 317
## Cubed: r^2 21.77 | F-score 317
#sort(unique(moneyball_train_imputed$TEAM_BATTING_SO ))
#sort(unique(model_2_dfa_transform$DEFENSE_INDEX))
#sort(unique(model_2_dfa_transform$OFFENSE_INDEX))
bcx_model2_2<- boxcox(model2_2)
optimal_lambda <- bcx_model2_2$x[which.max(bcx_model2_2$y)]
print(optimal_lambda) ## 1.27
## [1] 1.272727
## Applying Box Cox to column
##REFERENCE: https://www.statology.org/box-cox-transformation-in-r/
model_2_dfa_transform$TARGET_WINS <- (((model_2_dfa_transform$TARGET_WINS+1)^optimal_lambda - 1) / optimal_lambda)
## Adjusted model w/ BoxCox
model2_bcx<- lm(TARGET_WINS ~ DEFENSE_INDEX + OFFENSE_INDEX, data=model_2_dfa_transform)
print(summary(model2_bcx))
##
## Call:
## lm(formula = TARGET_WINS ~ DEFENSE_INDEX + OFFENSE_INDEX, data = model_2_dfa_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188.094 -30.629 0.259 29.205 203.063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.670467 9.926942 -3.090 0.00203 **
## DEFENSE_INDEX 1.152300 0.267844 4.302 1.76e-05 ***
## OFFENSE_INDEX 0.113744 0.004627 24.581 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.68 on 2271 degrees of freedom
## Multiple R-squared: 0.2131, Adjusted R-squared: 0.2124
## F-statistic: 307.5 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model2_bcx)
hist(model2_bcx$resid)
qqnorm(model2_bcx$resid)
skewness(model2_bcx$resid) ## 0.106
## [1] 0.1122622
## BOX COX Transform improves skewness, but lower the R^2 of the model.
### MODEL 3 Risk Index
# --- OTHER INCLUSION:
# TARGET_WINS
# TEAM_BATTING_BB_FLAG
# TEAM_BASERUN_CS_FLAG
# TEAM_BATTING_HBP_FLAG
# TEAM_BASERUN_SB_FLAG
#-- HIGH RISK OFFENSE:
# TEAM_BASERUN_STEAL_ATTEMPTS
# TEAM_BATTING_2B
# TEAM_BATTING_3B
#-- LOW RISK OFFENSE:
# TEAM_BATTING_1B
# TEAM_BATTING_WALK_TOTAL
## Custom Aggregate
model_3_dfa <- moneyball_train_imputed |>
mutate (TEAM_BATTING_1B = (TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR))) |>
dplyr::select( TARGET_WINS,TEAM_BATTING_BB_FLAG,TEAM_BASERUN_CS_FLAG, TEAM_BATTING_HBP_FLAG,TEAM_BASERUN_SB_FLAG,TEAM_BATTING_HR,TEAM_BATTING_1B,
TEAM_BATTING_WALK_TOTAL,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BASERUN_STEAL_ATTEMPTS) |>
mutate (HIGH_RISK = (TEAM_BASERUN_STEAL_ATTEMPTS+TEAM_BATTING_2B+TEAM_BATTING_3B),
LOW_RISK = (TEAM_BATTING_1B + TEAM_BATTING_WALK_TOTAL+TEAM_BATTING_HR)) |>
dplyr::select(-TEAM_BATTING_1B,-TEAM_BATTING_WALK_TOTAL,-TEAM_BATTING_2B,-TEAM_BATTING_3B,-TEAM_BASERUN_STEAL_ATTEMPTS,-TEAM_BATTING_HR)
model3_log <- lm(TARGET_WINS+1 ~ HIGH_RISK + LOW_RISK, data=model_3_dfa)
print(summary(model3_log))
##
## Call:
## lm(formula = TARGET_WINS + 1 ~ HIGH_RISK + LOW_RISK, data = model_3_dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.078 -9.081 0.197 9.110 56.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.664325 3.207541 2.389 0.017 *
## HIGH_RISK 0.018733 0.002900 6.459 1.29e-10 ***
## LOW_RISK 0.039120 0.002029 19.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.06 on 2271 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.2002
## F-statistic: 285.6 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model3_log)
## Comfirmign the Optimal Lambda
bcx_model3<- boxcox(model3_log)
optimal_lambda <- bcx_model3$x[which.max(bcx_model3$y)]
print(optimal_lambda)
## [1] 1.353535
model_3a_transform<-model_3_dfa
model_3a_transform$TARGET_WINS <- (((model_3a_transform$TARGET_WINS+1)^optimal_lambda - 1) / optimal_lambda)
model3_trans <- lm(TARGET_WINS ~ HIGH_RISK^3 + LOW_RISK^3, data=model_3a_transform)
print(summary(model3_trans))
##
## Call:
## lm(formula = TARGET_WINS ~ HIGH_RISK^3 + LOW_RISK^3, data = model_3a_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236.38 -44.09 -1.20 42.45 300.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.056186 15.004017 -3.336 0.000863 ***
## HIGH_RISK 0.087089 0.013567 6.419 1.66e-10 ***
## LOW_RISK 0.178321 0.009493 18.785 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.79 on 2271 degrees of freedom
## Multiple R-squared: 0.1938, Adjusted R-squared: 0.1931
## F-statistic: 273 on 2 and 2271 DF, p-value: < 2.2e-16
plot(model3_trans)
## DROPPING MODEL 3 B/C lower Fscore and lower R^2.
## CHOOSING MODEL 2
### Preparing Test Data using the same methodology of imputation etc as the training
## Adding imputation /missing data Flags to keep track of the changes:
## Present (P) and Missing (M)
moneyball_eval_data$TEAM_BATTING_SO_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_BATTING_SO), 'M', 'P')
moneyball_eval_data$TEAM_PITCHING_SO_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_PITCHING_SO), 'M', 'P')
moneyball_eval_data$TEAM_BASERUN_SB_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_BASERUN_SB), 'M', 'P')
moneyball_eval_data$TEAM_BASERUN_CS_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_BASERUN_CS), 'M', 'P')
moneyball_eval_data$TEAM_BATTING_BB_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_BATTING_BB), 'M', 'P')
moneyball_eval_data$TEAM_BATTING_HBP_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_BATTING_HBP), 'M', 'P')
moneyball_eval_data$TEAM_FIELDING_DP_FLAG <- ifelse(is.na(moneyball_eval_data$TEAM_FIELDING_DP), 'M', 'P')
moneyball_eval_data_processed <- moneyball_eval_data |>
mutate(TEAM_BASERUN_STEAL_ATTEMPTS = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB) + ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
TEAM_BATTING_WALK_TOTAL = ifelse(is.na(TEAM_BATTING_BB), 0, TEAM_BATTING_BB) + ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP)) |>
dplyr::select(-TEAM_BASERUN_CS, -TEAM_BATTING_HBP) #-TEAM_BASERUN_SB,-TEAM_BATTING_BB
## Confirming no major outliers like in the original training set
sort(unique(moneyball_eval_data_processed$TEAM_PITCHING_SO))
## [1] 0 315 323 324 352 361 367 368 378 401 408 413 420 421 426
## [16] 430 436 437 446 462 466 468 471 478 479 489 504 509 511 520
## [31] 524 526 527 529 535 540 545 554 558 559 569 573 576 577 579
## [46] 583 588 595 597 599 600 604 613 619 621 622 623 625 627 630
## [61] 631 632 634 635 642 644 645 648 653 656 660 662 663 665 666
## [76] 667 672 673 675 681 684 693 696 698 703 705 709 712 714 715
## [91] 717 719 722 725 727 728 729 731 734 735 736 745 747 758 761
## [106] 777 779 780 781 782 785 790 792 794 802 805 811 815 816 817
## [121] 824 827 828 829 831 832 841 842 849 850 854 855 857 860 870
## [136] 871 873 878 889 893 902 903 906 909 912 913 914 917 921 926
## [151] 929 930 932 938 944 947 953 958 963 964 966 969 970 971 973
## [166] 976 977 980 984 990 999 1002 1006 1008 1011 1021 1025 1027 1030 1031
## [181] 1032 1041 1042 1054 1056 1060 1061 1071 1080 1084 1090 1092 1094 1105 1119
## [196] 1123 1142 1145 1170 1177 1231 1268 1295 1462 9963
## Looking at the columns that need imputation (4/5 columns need imputation that needed imputation in trianing)
summary(moneyball_eval_data_processed)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. : 1155 Min. : 0.0 Min. : 136.0 Min. : 0.0
## 1st Qu.: 1426 1st Qu.: 52.0 1st Qu.: 471.0 1st Qu.: 613.0
## Median : 1515 Median :104.0 Median : 526.0 Median : 745.0
## Mean : 1813 Mean :102.1 Mean : 552.4 Mean : 799.7
## 3rd Qu.: 1681 3rd Qu.:142.5 3rd Qu.: 606.5 3rd Qu.: 938.0
## Max. :22768 Max. :336.0 Max. :2008.0 Max. :9963.0
## NA's :18
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG
## Min. : 73.0 Min. : 69.0 Length:259 Length:259
## 1st Qu.: 131.0 1st Qu.:131.0 Class :character Class :character
## Median : 163.0 Median :148.0 Mode :character Mode :character
## Mean : 249.7 Mean :146.1
## 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :1568.0 Max. :204.0
## NA's :31
## TEAM_BASERUN_SB_FLAG TEAM_BASERUN_CS_FLAG TEAM_BATTING_BB_FLAG
## Length:259 Length:259 Length:259
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_STEAL_ATTEMPTS
## Length:259 Length:259 Min. : 0.0
## Class :character Class :character 1st Qu.: 88.5
## Mode :character Mode :character Median :137.0
## Mean :152.2
## 3rd Qu.:198.0
## Max. :580.0
##
## TEAM_BATTING_WALK_TOTAL
## Min. : 15.0
## 1st Qu.:443.0
## Median :512.0
## Mean :503.5
## 3rd Qu.:573.0
## Max. :792.0
##
# --- TEAM_PITCHING_SO,TEAM_BATTING_SO,TEAM_FIELDING_DP,TEAM_BASERUN_SB - Imputation with MICE PMM Needed. Probably need to remove the two outlier columns from the TEAM_PITCHING_SO column, and the data at large.
imputed <- mice(moneyball_eval_data_processed,m=5,meth='pmm',seed=5)
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO TEAM_FIELDING_DP
## Warning: Number of logged events: 7
moneyball_eval_imputed<-complete(imputed)
#Checking for nulls where there shouldnt be.
print(nrow(moneyball_eval_imputed |> filter(is.na(TEAM_PITCHING_SO))))
## [1] 0
## Working DF
print(head(moneyball_eval_imputed))
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 9 1209 170 33 83
## 2 10 1221 151 29 88
## 3 14 1395 183 29 93
## 4 47 1539 309 29 159
## 5 60 1445 203 68 5
## 6 63 1431 236 53 10
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 1 447 1080 62 1209
## 2 516 929 54 1221
## 3 509 816 59 1395
## 4 486 914 148 1539
## 5 95 416 52 3902
## 6 215 377 55 2793
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 1 83 447 1080 140
## 2 88 516 929 135
## 3 93 509 816 156
## 4 159 486 914 124
## 5 14 257 1123 616
## 6 20 420 736 572
## TEAM_FIELDING_DP TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG
## 1 156 P P
## 2 164 P P
## 3 153 P P
## 4 154 P P
## 5 130 P P
## 6 105 P P
## TEAM_BASERUN_SB_FLAG TEAM_BASERUN_CS_FLAG TEAM_BATTING_BB_FLAG
## 1 P P P
## 2 P P P
## 3 P P P
## 4 P P P
## 5 M M P
## 6 M M P
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_STEAL_ATTEMPTS
## 1 M P 112
## 2 M P 93
## 3 M P 106
## 4 P P 205
## 5 M P 0
## 6 M P 0
## TEAM_BATTING_WALK_TOTAL
## 1 447
## 2 516
## 3 509
## 4 528
## 5 95
## 6 215
# Predicting TARGET_WINS using Model 2 (Box Cox) improved model with the eval data.
moneyball_eval_formatted <- moneyball_eval_imputed |>
dplyr::select(TEAM_BATTING_SO_FLAG,TEAM_PITCHING_SO_FLAG,TEAM_BATTING_BB_FLAG,TEAM_BATTING_HBP_FLAG,TEAM_FIELDING_DP_FLAG,TEAM_BASERUN_SB_FLAG,
TEAM_PITCHING_SO,TEAM_FIELDING_DP,TEAM_FIELDING_E,TEAM_BATTING_H,TEAM_BATTING_WALK_TOTAL,TEAM_BATTING_BB,TEAM_BASERUN_SB,TEAM_BATTING_SO) |>
mutate( DEFENSE_INDEX = ((TEAM_PITCHING_SO + TEAM_FIELDING_DP) / TEAM_FIELDING_E),## Shifting this to keep things positive, but maintain the error influence
OFFENSE_INDEX=(TEAM_BATTING_H+TEAM_BATTING_BB+TEAM_BASERUN_SB)) |>
dplyr::select(-TEAM_PITCHING_SO,-TEAM_FIELDING_DP,-TEAM_FIELDING_E,-TEAM_BATTING_H,-TEAM_BATTING_WALK_TOTAL,-TEAM_BATTING_BB,-TEAM_BASERUN_SB)
predictions <- predict(model2_2, newdata=moneyball_eval_formatted) # Predict on TestData
## PREDICTED WINS
predicted_win_data <- cbind(moneyball_eval_imputed, PRED_TARGET_WINS = predictions)
head(predicted_win_data)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 9 1209 170 33 83
## 2 10 1221 151 29 88
## 3 14 1395 183 29 93
## 4 47 1539 309 29 159
## 5 60 1445 203 68 5
## 6 63 1431 236 53 10
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 1 447 1080 62 1209
## 2 516 929 54 1221
## 3 509 816 59 1395
## 4 486 914 148 1539
## 5 95 416 52 3902
## 6 215 377 55 2793
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 1 83 447 1080 140
## 2 88 516 929 135
## 3 93 509 816 156
## 4 159 486 914 124
## 5 14 257 1123 616
## 6 20 420 736 572
## TEAM_FIELDING_DP TEAM_BATTING_SO_FLAG TEAM_PITCHING_SO_FLAG
## 1 156 P P
## 2 164 P P
## 3 153 P P
## 4 154 P P
## 5 130 P P
## 6 105 P P
## TEAM_BASERUN_SB_FLAG TEAM_BASERUN_CS_FLAG TEAM_BATTING_BB_FLAG
## 1 P P P
## 2 P P P
## 3 P P P
## 4 P P P
## 5 M M P
## 6 M M P
## TEAM_BATTING_HBP_FLAG TEAM_FIELDING_DP_FLAG TEAM_BASERUN_STEAL_ATTEMPTS
## 1 M P 112
## 2 M P 93
## 3 M P 106
## 4 P P 205
## 5 M P 0
## 6 M P 0
## TEAM_BATTING_WALK_TOTAL PRED_TARGET_WINS
## 1 447 68.45039
## 2 516 71.41741
## 3 509 77.84143
## 4 528 85.91625
## 5 95 60.51704
## 6 215 64.95147
#write.csv(predicted_win_data, "TARGET_WINS_Prediction_moneyball_data.csv")
### NO Actual TARGET_WINS in the test / eval data, so cant demonstrate accuracy.
## NOT USING BECAUSE INSTRUCTIONS SAID TO MANUALLY SELECT VARIABLES
#Pivoting to the technique used in Week 5 Content videos, as none of the models I manually thought of were great.
#install.packages("leaps")
#library(leaps)
## Limiting to certain variables, and keeping only one version of each of the directly linearly related variables.
#moneyball_train_imputed_lim <-moneyball_train_imputed |>
# mutate(TEAM_BATTING_1B = TEAM_BATTING_H - (TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR)) |>
# dplyr::select(TARGET_WINS,TEAM_BATTING_1B,TEAM_BATTING_2B,TEAM_BATTING_2B,TEAM_BATTING_HR,TEAM_PITCHING_SO,
# TEAM_BATTING_WALK_TOTAL,TEAM_BASERUN_STEAL_ATTEMPTS,TEAM_FIELDING_E,TEAM_FIELDING_DP)
#moneyball_train_imputed_lim
## Limited df for the ModelSelection Route.
#regfit.full = regsubsets(TARGET_WINS~.,data=moneyball_train_imputed_lim, nvmax=9)
#reg.summary<-summary(regfit.full)
### Plottign the CP in order to pick the ideal number of variables
#plot(reg.summary$cp, xlab="Number of Variables", ylab="Cp")
## Only 9 var ,so lower may be higher vars. Confirming:
#which.min(reg.summary$cp)
## Model number 6 is ideal.Looking at the other CP chart from video
# Looking and 7 is actually the lowest for
#plot(regfit.full, scale="Cp")
#coef(regfit.full, 6)